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Abstract. Machine learning methods are being increasingly adopted in behav- 
ioral research. Lasso regression performs variable selection and regularization, 
and is particularly appealing to behavioral researchers because of its connection 
to linear regression. Researchers may expect properties of linear regression to 
translate to lasso, but we demonstrate that this assumption is problematic for mod- 
els with categorical predictors. Specifically, we demonstrate that while the coding 
strategy used for categorical predictors does not impact the performance of linear 
regression, it does impact lasso’s performance. Group lasso is an alternative to 
lasso for models with categorical predictors. We investigate the discrepancy be- 
tween lasso and group lasso models using a real data set: lasso performs different 
variable selection and has different prediction accuracy depending on the coding 
strategy, while group lasso performs consistent variable selection but has different 
prediction accuracy. Using a Monte Carlo simulation, we demonstrate a specific 
case where group lasso tends to include many variables when few are needed, 
leading to overfitting. We conclude with recommended solutions to this issue and 
future directions of exploration to improve the implementation of machine learn- 
ing approaches in behavioral science. This project shows that when using lasso 
and group lasso with categorical predictors, the choice of coding strategy should 
not be ignored. 
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1 Introduction 


Many behavioral research questions involve categorical predictors, including edu- 
cation, ethnicity, religion, gender, or experimental conditions. Unlike numerical predic- 
tors, which typically have a natural scale, to be included in statistical models categorical 
predictors require researchers to select a method for encoding these variables (i.e., rep- 
resenting the categories using a numeric system). Thus, a single categorical predictor 
can be represented in a model using different sets of variables, each set embodying the 
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same predictor but representing different contrasts of the categories. This special prop- 
erty of categorical predictors motivates our exploration of categorical predictors in the 
case of linear regression and two machine learning algorithms: least absolute shrinkage 
and selection operator (lasso; Tibshirani, 1996) and group lasso regression (Yuan & Lin, 
2006). We explore both variable selection and prediction accuracy for these models and 
how they are impacted by using different coding strategies for categorical predictors 
using a real-world data set. 

We use a data set focusing on stress during COVID-19 as the primary outcome, 
measured in over 100,000 participants (Yamada et al., 2021). The stress score is an 
aggregated score from the Perceived Stress Scale (PSS-10) on a 1-5 scale. The data 
set includes categorical predictors, such as Education, Gender and Marital Status, and 
continuous predictors, such as Age and Trust in the Country. The overall goal is to 
predict participant’s Stress using the available predictors. 

In the remainder of this section, we introduce the three analytical approaches exam- 
ined in this paper: linear regression, lasso regression, and group lasso regression. We 
focus on the application of these methods with a continuous outcome and one or more 
categorical predictors. After introducing these methods, we demonstrate their use with 
the applied example, exploring peculiar behavior of the machine learning approaches 
that does not occur with linear regression. 


1.1 Linear Regression With Categorical Predictors 


Categorical predictors need to be encoded into a set of variables to be included 
in regression models. Different coding strategies can be implemented, such as dummy, 
contrast, sequential, or Helmert coding. Tables 1—4 show different ways to encode a cat- 
egorical variable, Education, with 7 categories (no education, up to 6 years of school, 
up to 9 years of school, up to 12 years of school, some college or equivalent, college 
degree, PhD/doctorate). Dummy coding uses only 0’s and 1’s to indicate category mem- 
bership. One category is selected as the reference category (or reference group) and is 
assigned a score of 0 on all indicators. For other categories, only the indicator corre- 
sponding to the category is coded as 1 and all other indicators are set to 0 (Table 1). 
Contrast coding is similar to dummy coding, but the reference category which is coded 
as all 0 in dummy coding is now coded with all -1 instead, changing the interpreta- 
tion of the intercept and slope coefficients (Table 2). Sequential coding compares each 
category to the previous category (Table 3), while Helmert coding examines how each 
category is compared to the average of all subsequent categories (Table 4). Note that 
if a categorical variable has k categories, k — 1 indicators are needed, regardless of the 
coding strategies used. This type of design matrix is defined as nonsingular because 
the matrix is invertible. The design matrix has to be nonsingular for linear regression 
but this is not necessarily the case for lasso or group lasso. In Appendix B we discuss 
singular matrix options for lasso regression. 

In linear regression, each coding scheme represents categories using a different nu- 
merical system, which leads to different interpretations of their coefficients. However, 
each coding scheme always predicts the category mean for each category (or adjusted 
means if covariates are included), and the explained variance is the same regardless of 
coding choice (Darlington & Hayes, 2016). Therefore, researchers can choose coding 
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Table 1: Dummy Coding 


Education Dı Dz D3 D4 Ds D6 
1. no education 0 0 0 0 0 0 
2. up to 6 years of school 1 0 0 0 0 0 
3. up to 9 years of school 0 1 0 0 0 0 
4. up to 12 years of school 0 0 1 0 0 0 
5. some college or equivalent 0 0 0 1 0 0 
6. college degree 0 0 0 0 1 0 
7. PhD/doctorate 0 0 0 0 0 1 


Note. No education is selected as the reference group (coded 0 on all indicators) and every other 
category scores | on a single indicator and 0 on all other indicators. 


Table 2: Contrast Coding 


Education Ci C2 C3 C4 C5 C6 
1. no education 1 0 0 0 0 0 
2. up to 6 years of school 0 1 0 0 0 0 
3. up to 9 years of school 0 0 1 0 0 0 
4. up to 12 years of school 0 0 0 1 0 0 
5. some college or equivalent 0 0 0 0 1 0 
6. college degree 0 0 0 0 0 1 
7. PhD/doctorate -1 -1 -1 -1 -1 -1 


Note. PhD/doctorate is selected as the omitted category (coded -1 on all indicators) and every 
other category scores | on a single indicator and 0 on all other indicators. 


strategies among all these options according to their needs without concern about model 
performance. Dummy and contrast coding are often used for nominal categorical vari- 
ables, while sequential and Helmert coding are particularly helpful when categories are 
ordered. 

When using different coding strategies, the regression coefficients have different 
interpretations. For example, a researcher might want to know whether Stress during 
the COVID-19 pandemic can be predicted by Education. The seven categories within 
the variable Education are encoded by 6 indicators. Linear regression fits the following 
model: 

Y; = Bo + BiX1i + B2X2i + B3X3i + BaXai + BsX5; + BoXoi + £i, (1) 


where Y; is the outcome value for the i” observation (person), X ji is the j!” variable 


to convey category membership for the i” observation, and g; is the error term for the 
i” observation. Equation 1 is the general equation for all coding strategies. If different 
coding strategies are used, the intercept Bo and coefficients for different indicators, By 
through p6, have different meanings. For example, suppose the fitted linear regression 


model (with Y; representing the predicted value for the i” observation) is 


¥; = 2 + 0.3X1; + 1.5Xo; + 0.2X3; + 0.5X4; — 0.2X5; — 0.4X6;. (2) 


The interpretation of these coefficients would depend on which coding strategy was 
used. If dummy coding was used with no education as the reference group (as in Table 
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Table 3: Sequential Coding 

Education S 
. no education 0 
. up to 6 years of school 1 
. up to 9 years of school 1 
1 

1 

1 


i 
K 
A 
D 


. up to 12 years of school 
. some college or equivalent 
. college degree 

7. PhD/doctorate 1 
Note. The lowest category scores 0 on all indicators. Each subsequent category scores 1 on one 
more indicator than the previous. 


NN WN KE 
ePeePrro am 


ePrereProco adm 


Table 4: Helmert Coding 


Education A Ab A H4 Hs Ho 
1. no education -6/7 0 0 0 0 0 
2. up to 6 years of school 1/7 -5/6 0 0 0 0 
3. up to 9 years of school 1/7 1/6 -4/5 0 0 0 
4. up to 12 years of school 1/7 1/6 1/5 -3/4 0 0 
5. some college or equivalent 1/7 1/6 1/5 1/4 -2/3 0 
6. college degree 1/7 1/6 1/5 1/4 1/3 -1/2 
7. PhD/doctorate 1/7 1/6 1/5 1/4 1/3 1/2 


Note. The lowest indicator scores — (k — 1)/k on the first indicator and 0 on all subsequent 
indicators. The next highest scores 1/k on the first indicator, — (k — 2)/(k — 1) on the second 
indicator, and 0 on all subsequent indicators. The next highest scores 1 /k on the first indicator, 
1/(k— 1) on the second indicator, — (k — 3)/(k — 2) on the third indicator, and 0 on all 
subsequent indicators. And so on. 


1), we would interpret the coefficient for X4, 0.5, as the difference between the average 
stress score of individuals with no education and the average stress score with some 
college education. However, if contrast coding was used (as in Table 2), 0.5 would in- 
dicate the difference between the average stress score of individuals with up to 12 years 
of school and the average score of all categories. If sequential coding was used (as in 
Table 3), 0.5 would be interpreted as the difference between the average stress score of 
individuals with some college education and the average stress score of individuals with 
up to 12 years of school. If Helmert coding was used (as in Table 4), 0.5 would indicate 
that on average individuals with up to 12 years of school are 0.5 points less stressed 
than the average of those who have some college education, those who have a college 
degree and those who have a PhD/Doctorate. The interpretations of the coefficients are 
inseparable from the coding strategy used. 

Different selections of reference categories in dummy and contrast coding and or- 
dering of categories in Helmert and sequential coding can also produce coefficients with 
different meanings. For example, if no education is the reference category for dummy 
coding, Bo represents the average stress score for people with no education and fı 
through Be will represent the difference between no education and the corresponding 
coded category. On the other hand, if up to 6 years of school is the reference cate- 
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gory, Bo represents the average stress for individuals with up to 6 years of school, and 
Pı through pe will represent the difference between “up to 6 years of school” and the 
corresponding coded category. 

Though different ways to code categorical variables produce different model coef- 
ficients, they do not affect the predictions/prediction accuracy of linear regression. To 
demonstrate that linear regression with a categorical predictor will predict the same cat- 
egory means for each coding scheme, we used Education to predict Stress. We randomly 
sampled 10,000 participants from the COVID-19 Stress Data (Yamada et al., 2021) to 
serve as our sample data set, and then we randomly split our sample into training (80%) 
and test (20%) data. Next, we fit linear regression on the training data set with four 
different coding strategies from Tables 1 - 4 applied to the variable Education. Table 5 
contains the model coefficients. 


Table 5: Linear Regression Example for Coding 


Coefficient Dummy Contrast Sequential Helmert 
Bo 2.852 2.955 2.852 2.955 
By 0.031 -0.103 0.031 0.121 
Bo 0.110 -0.072 0.079 0.107 
Bs 0.145 0.007 0.035 0.036 
By 0.161 0.041 0.016 0.001 
Bs 0.138 0.058 -0.023 -0.022 
Be 0.139 0.035 0.001 0.001 


Note. Each column of the table represents one coding strategy and rows represent the 
coefficients of the indicator X; for each coding strategy. 


Using the values of X;—X6 from Table 1—4 and the coefficient estimates from Table 
5, we reconstruct the predicted score (i.e., category mean) for the “some college or 
equivalent” category for dummy, contrast, sequential, and Helmert coding respectively. 


2.852 + 0.031(0) +0.110(0) +.0.145(0) +0.161(1) +0.138(0) +0.139(0) =3.013 (Dummy 
2.955 — 0.103(0) — 0.072(0) + 0.007 (0) + 0.041 (0) +.0.058(1) +0.035(0) =3.013 (Contrast 
2.852 +.0.031(1) +0.079(1) +0.035(1) +0.016(1) — 0.023(0) +.0.001(0) = 3.013 (Sequential 

1 1 1 1 2 
2.955 +0.121 (7) +0.107( =) +0.036( <) +0.001 (7) —0.022(— Z) +0.001(0) =3.013 (Helmert 


The predicted score for “some college or equivalent” using dummy coding is the same 
as that for contrast, sequential, and Helmert coding. Following a similar procedure, 
it can be shown that all predicted scores match the category means for each coding 
strategy (Cohen, Cohen, West, & Aiken, 2003; Darlington & Hayes, 2016). 

Since predicted scores are the same across coding strategies in linear regression, this 
means prediction accuracy is also the same across the different coding strategies. In our 
example data, prediction accuracy quantifies how far a model’s predicted stress scores 
are from the observed stress scores of participants in the test data. We use Mean Squared 


) 
) 
) 
) 
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Error (MSE) to measure the prediction accuracy. Mathematically, MSE is calculated as 
MSE = —Y (Y;—Y¥;,)’, 3 
- LI ) (3) 


where n represents the number of observations in the test data; Y; represents the ob- 
served outcome value of the i” observation in the test data; and Y; represents the pre- 
dicted outcome value of the i” observation from the model (which is generated using 
the training data). When we calculate the MSE of four linear regression models each 
fit using one of the four coding strategies mentioned previously, we find that all models 
have the exact same MSE of 0.13674. This illustrates that prediction accuracy is not 
affected by coding strategy when using linear regression. 

While these results may seem trivial and require only a basic understanding of linear 
regression to understand, they stand in stark contrast to similar results we will examine 
in alternative regularized regression approaches. In summary, linear regression models 
with different coding strategies predict the same scores (i.e., category means) and give 
the same prediction accuracy, though they produce different coefficients. These proper- 
ties persist when there are additional predictors (categorical and/or continuous) in the 
model, where the predicted scores (which are now adjusted means) are the same for all 
coding strategies, and thus prediction accuracy is always the same as well. 


1.2 Lasso and Group Lasso Regression 


In contrast to linear regression, lasso regression is useful when the proposed model 
involves many predictors, but only a few may be true predictors of the outcome (i.e., 
sparsity). Lasso is gaining popularity in behavioral science presumably because it shares 
many properties with linear regression, an already common statistical approach in the 
field (McNeish, 2015). For example, a lasso model fit to the COVID-19 data using 
Education to predict Stress would share the same equation as linear regression given 
in Equation 1. However, the values of the p; coefficients would differ between the 
two methods because linear and lasso regression differ in the way they estimate the 
vector containing these regression coefficients, 8. In linear regression, the estimated 
coefficient vector is calculated as follows, 


Êiinear = en = XB 2), (4) 


where |:|» is the notation for the L2 norm. Lasso, on the other hand, adds a penalty term 
governed by the penalty parameter À to regulate the size of the coefficients: 


Biasso = i a -X| +All), (5) 


where |:|; is the notation for the L1 norm.! When A is nonzero, nonzero values of 
B result in increases in A|B|,, and so Equation 5 reaches its minimum when both the 


' Another alternative to lasso is ridge regression which is expressed by Equation 5 except with 
an L2 norm instead of an L1 norm for the regularization term. In Equation 5, the L1 norm 
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prediction error and the size of the elements of B are considered. A large À value results 
in the coefficients in B being shrunk toward or equal to zero so fewer predictor variables 
are selected in the model (where “selected” means that the coefficient is nonzero in the 
final solution). A small A value, on the other hand, results in less shrinkage so more 
predictor variables can be selected into the model. Linear regression is actually a special 
case of lasso regression when A is set to zero. 

While lasso has many benefits over linear regression (Hastie & Tibshirani, 2018; 
McNeish, 2015; Tibshirani, 1996), when applying lasso regression to models with cat- 
egorical predictors, additional considerations must be made. Lasso regression models 
select variables based on the penalty parameter A and the sizes of the entries in coef- 
ficient vector B. However, as we demonstrated with linear regression, using different 
coding strategies for a categorical predictor creates models with different coefficient 
vectors. This means that the choice of coding strategy may result in different variable 
selection in lasso regression models. The issue of coding strategies is related to the 
issue of variable scaling with continuous predictors, which also influences variable se- 
lection and prediction accuracy in lasso regression models. One common solution to 
this problem is to standardize all continuous predictors before applying lasso regres- 
sion (Marquardt, 1980). In this way, the effect of scaling is excluded from the variable 
selection of lasso regression with continuous predictors. While dichotomous variables 
can be standardized, different coding strategies representing more than two categories 
do not result in the same standardized solution. Given this, there is reason to believe 
that the performance of lasso regression with categorical variables may be impacted by 
the choice of coding strategies for those variables. 

A generalization of lasso regression which may also be impacted by coding strategy — 
but in different ways—is group lasso regression. Group lasso, as opposed to lasso, per- 
forms variable selection by selecting groups of variables rather than individual variables 
(Yuan & Lin, 2006). This is particularly valuable for the case of categorical predictors 
because the set of indicators for each variable forms a natural group. The mathematical 
formula for estimating the coefficient vector f in group lasso is 


G 
Beroup = a L IAR) (6) 
g=l 


where G represents the number of groups of variables, and B;, represents the coefficient 
vector of that corresponding group. Other notation is the same as Equation 5. Using 
the L2 norm within each group g is what allows group lasso to either select all or 
none of the variables within each group. Also, multiplying by À after summing the L2 
norms of all groups penalizes each group instead of each individual indicator variable. 
These differences provide group lasso with distinct properties: When all variables are 
considered one group, group lasso performs as ridge regression. On the other hand, 
when all the variables are their own group, group lasso performs as lasso regression. 


penalizes the absolute value of the coefficients, used by lasso; while in ridge regression, the 
L2 norm penalizes the squares of all coefficients. Given this property, ridge regression is not 
as effective at penalizing parameters to zero compared to lasso regression (Tibshirani, 1996). 
Therefore, lasso regression is preferred for variable selection. 
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The advantage of group lasso is that when there are multiple groups of more than one 
variable, the result is a combination of within-group ridge regression and across-group 
lasso regression. 

The group lasso has special properties with respect to variable selection. Within a 
group, group lasso typically includes or excludes all variables because of the within- 
group ridge regression. Given its unique properties with respect to variable selection, 
group lasso has been recommended as a useful alternative to lasso regression when 
dealing with models with categorical variables (Detmer, Cebral, & Slawski, 2020; Mc- 
Neish, 2015); however, no prior research has explored the sensitivity of group lasso to 
different coding strategies. In group lasso, all indicators for a categorical variable are 
defined as a group, and the algorithm should either include all indicators associated 
with one categorical predictor or exclude all these indicators. 


1.3 Motivation 


With the increasing use of lasso techniques across scientific fields, but especially 
within the social and behavioral sciences, many researchers rely on their intuitions 
about the similarities between lasso and linear regression to understand, use, and in- 
terpret the results of lasso regression. This could be particularly problematic for models 
with categorical predictors. Prediction accuracy in linear regression is unaffected by the 
selection of coding strategy; however, lasso regression conducts regularization by min- 
imizing regression coefficients, which differ across coding strategies. This may lead to 
different prediction accuracy and variable selection depending on the coding strategy 
used when using lasso. Since group lasso treats the variables in a group as a whole set, 
it seems less likely that its variable selection will be impacted by the choice of coding 
strategy. However, the prediction accuracy of group lasso may still be impacted by the 
coding strategy. 

To explore the potential impacts of coding strategy on important characteristics of 
lasso and group lasso regression, we combine both real data analysis and simulation. 
First, using the COVID stress data set described previously, we demonstrate the use 
of lasso and group lasso regression with categorical variables, where different coding 
strategies of categorical variables impact two aspects of model performance: variable 
selection and prediction accuracy. Next, we use a Monte Carlo simulation to demon- 
strate a specific case where group lasso may tend to overfit the training data. In the last 
section, we explore other potential solutions, important future directions, and general 
conclusions. 


2 Real Data Analysis with COVID Stress Data 


We used the COVID stress data set with the same sample of 10,000 participants 
and the same training/test data sets used in Section 1.1 to explore how coding strategies 
affect models estimated by lasso and group lasso. In the models, we included six cat- 
egorical predictors (where a predictor with k categories was represented by k — 1 indi- 
cator variables): Education (7 categories), Employment status (6 categories), Gender (3 
categories), Isolation status (4 categories), Marital status (4 categories), and Mother’s 
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education (7 categories). We also included seven continuous predictors in the models. 
Thus, after coding all categorical variables and adding the seven continuous variables, 
the models predicted the outcome Stress with 6+5+2+3+3+6+7 = 32 variables. 
In total, we trained eight different models using lasso and group lasso with four coding 
strategies: dummy, contrast, sequential, and Helmert. We used 10-fold cross-validation 
on the training data to select the penalty parameter from the model with the best predic- 
tion accuracy, so the penalty parameter that was selected is different across models with 
different coding strategies.” We then examined if the variable selection and prediction 
accuracy of these lasso and group lasso models were affected by the choice of coding 
strategy. 


2.1 Variable Selection 


We first examined differences in the variable selections of the four lasso models. 
Results are shown in Table 6. Focusing on the Education variable, we illustrate how 
the use of different coding strategies can result in conflicting findings. Both the dummy 
coding model and the sequential coding model have a predictor which represents the 
difference between no education and 6 years of education. After applying lasso, the 
dummy coding model includes this predictor, whereas the sequential coding model 
excludes this predictor. Based on these results, using the dummy coded model, a re- 
searcher might conclude that COVID stress differs across the no education and 6 years 
of education groups, whereas using a sequential coded model, the opposite conclusion 
would be made. 

Fitting similar dummy-, contrast-, sequential-, and Helmert-coded models with group 
lasso, we found that the results differed notably from the traditional lasso. While lasso’s 
variable selection was affected by the choice of coding strategy (see Table 6), the group 
lasso’s variable selection seemed stable across different coding strategies, with all pre- 
dictor variables selected to remain in all four models. Thus, based on the applied data 
analysis, it seems that variable selection is not impacted by the coding strategy for 
group lasso, though this should be subject to additional investigation. This suggests that 
if researchers are interested in using lasso for variable selection and have categorical 
predictors, using group lasso could avoid the arbitrary choice of coding strategy. How- 
ever, group lasso was not successful in reducing the set of potential predictors, and thus, 
it may suffer from a limitation of being overly inclusive. We explore this issue more in 
a simulation. 


2.2 Prediction Accuracy 


In this section, we investigate whether prediction accuracy is affected by the choice 
of coding strategy using both lasso and group lasso. We examined the prediction accu- 
racy in two ways: predicted category scores and MSE of the model applied to the test 
data set. 


2 Note that even with the same penalty parameter, models with different coding strategies or 
reference categories will still have different variable selection and prediction accuracy. 
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Table 6: Variable Selection for Different Coding Strategies by Lasso 


Lasso Regression 


Variable Dummy Contrast 

6 years - no no - Average 

9 years - no 6 years - Average 
picasa 12 years - no © 9years- Average 
some college - no 


PhD - no 


12 years - Average 
some college - Average 


Employment Status 


part-time - no 


student - no 
full-time - no 
retired - no 


no - Average 


self-employed - Average 
student - Average 
full-time - Average 


Sequential 


9 years - 6 years 
12 years - 9 years 
some college - 12 years 
college - some college 
PhD - college 


Helmert 
no - Average(6 years and more) 
6 years - Average(9 years and more) 
| 9 years - Average(12 years and more) — 
12 years - Average(some college, college, PhD) 
some college - Average(college + PhD) 
college - PhD 


part-time - no 
self-employed - part-time 
student - self-employed 
full-time - student 
retired - full-time 


no - Average(part-time, self-employed, student, full-time, retired) 
self-employed - Average(student, full-time, retired) 
student - Average(full-time, retired) 
full-time - retired 


Gender 


man - woman 
other - woman 


woman - Average 
man - Average 


man - woman 
other - man 


woman - Average(man, other) 
man - other 


Isolation Status 


minor changes - usual 
isolated - usual 
medical isolated - usual 


usual - Average 


isolated - Average 


minor changes - usual 
isolated - minor changes 
medical isolated - isolated 


usual - Average(minor changes, isolated, medical isolated) 
minor changes - Average (isolated, medical isolated) 
isolated - medical isolated 


Marital Status 


divorced - single 
married - single 
other - single 


single - Average 
divorced - Average 
married - Average 


divorced - single 
married - divorced 
other - married 


single - Average(divorced, married, other) 
divorced - Average(married, other) 
married - other 


Mom’s Education 


6 years - no 


12 years - no 
some college - no 
college - no 


no - Average 
6 years - Average 
9 years - Average 
12 years - Average 
some college - Average 
college - Average 


6 years - no 
9 years - 6 years 
12 years - 9 years 
some college - 12 years 
college - some college 
PhD - college 


no - Average(6 years and more) 
6 years - Average(9 years and more) 
9 years - Average(12 years and more) 
12 years - Average(some college, college, PhD) 
some college - Average(college, PhD) 
college - PhD 


Note. Variables with a white background color were selected to be in the model, and variables with a grey background color were not selected. 
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Predicted Category Scores We first examined whether the predicted stress score for 
each Education group is the same with different coding strategies in lasso and group 
lasso models. In this section, We generated the predicted score for each category using a 
model with only Education as a predictor, so the models contained 6 indicator variables 
in total. While this model is oversimplified, it eases the direct comparison between the 
true means of each group and the predicted scores. 

The predicted category scores for lasso models fit using the four different coding 
strategies are shown in Table 7, with the final column providing the actual category 
means for Education observed in the training data. First off, it is important to note 
that category scores shown in the table were rounded. Thus, some category scores that 
were very close to the actual category scores were rounded to the same value, but there 
were no lasso models where the predicted scores were exactly equal to the group means 
like they would have been in linear regression. Also, it is evident in the table that the 
predicted means often differ depending on the coding strategy used. For five of the 
seven categories, the dummy-coded model estimated the category mean most accurately 
among all models. 


Table 7: Predicted Category Scores for Different Coding Strategies by Lasso 


Dummy Contrast Sequential Helmert Training Mean Test Mean 


None 2.864 2.864 2.872 2.852 2.912 
6 years 2.897 2.888 2.896 2.883 2.824 
9 years 2.962 2.965 2.973 2.962 2.857 
12 years 2.997 2.995 2.997 2.997 3.038 
Some college 3.013 3.012 3.008 3.013 3.009 
College 2.990 2.990 2.991 2.990 2.999 
PhD/Doctorate 2.991 E| 2.991 2.991 2.991 3.008 


Note. Rows represent Education categories, and the middle four columns give the model 
predicted values with different coding strategies. The last two columns give the actual mean of 
each category observed in the training and test data, respectively. The closest value to the 
training mean is bolded and the closest value to the test mean has a grey background color in 
each row. 


The results of the predicted category scores for the four group lasso models, shown 
in Table 8, are very similar to the lasso models: Group lasso estimated each category 
score within a categorical variable differently depending on the coding strategy used. 
Thus, although variable selection is not impacted by the coding strategy used for group 
lasso, the predicted category score is impacted by the choice of coding strategy. Also, 
among all group lasso models, the dummy-coded model generated the most accurate 
category scores for four of the seven categories. Thus, regardless of whether lasso or 
group lasso was used, the dummy-coded model estimated the majority of the category 
means better than the other three models. It is unclear whether this finding would remain 
true with other data sets, however. 

The results in Table 7 and 8 show that different coding strategies result in different 
predicted category scores. While this is an important finding, it is equally important 
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Table 8: Predicted Category Means for Different Coding Strategies by Group Lasso 


Dummy Contrast Sequential Helmert Training Mean Test Mean 


None 2.828 2.863 BÆ 2.868 2.852 2.912 
6 years mss 2.892 2.906 2.894 2.883 2.824 
9 years 2.965 2.962 [ROS 2.963 2.962 2.857 
12 years Ha 2.996 2.994 2.996 2.997 3.038 
Some college 3.013 3.012 3.012 3.013 3.009 
College 2.990 2.990 2.990 2.990 2.999 


PhD/Doctorate Ea 2.991 2.991 2.990 2.991 3.008 


Note. Same as Table 7. 


to understand why this occurs and whether the degree of difference is predictable and 
understandable rather than random variability due to estimation. A core aspect of lasso 
and group lasso models is shrinkage: different coding strategies will result in different 
model intercepts and coefficients, because the degree of shrinkage is different across 
coding strategies. 

To visualize the shrinkage effect of each coding strategy, we plotted the predicted 
scores from each lasso model along with each model’s intercept in Figure 1. In the 
dummy-coded model, the predicted scores are all shrunk slightly toward the no edu- 
cation category score (since it is the intercept in this model) relative to the contrast- 
coded model, where the scores are instead all pulled closer to the grand mean (i.e., the 
model’s intercept). The predicted scores from the sequential-coded and Helmert-coded 
models, on the other hand, are shrunk closer towards each other more than those from 
the dummy-coded or contrast-coded models, reflecting the fact that shrinkage in se- 
quential coding and Helmert coding relies not on the intercept, but on the differences 
between neighboring categories or the average of multiple neighboring categories. For 
example, the 9 years and the some college categories are shrunk closer to the college 
category or Phd/Doctorate category in sequential-coded and Helmert-coded models. In 
summary, models fit with different coding strategies have different shrinkage patterns, 
and so predicted scores differ across these models, leading to different prediction accu- 
racy. These results suggest that one way to select a coding strategy is to consider the 
pattern of shrinkage which seems most reasonable. 


Model Fit Next, we recorded MSEs calculated from models including all six categor- 
ical variables and all seven continuous variables, to the test data set (Table 9). Model 
fit (MSE) differs by coding strategy for both lasso and group lasso. Contrast-coded 
models yielded the best MSE for both lasso and group lasso regression. This exposes 
uncertainty regarding which coding strategy should be used when lasso or group lasso 
regression is applied. While some differences in MSE are expected due to the stochas- 
tic nature of procedures like cross-validation used to choose the penalty parameter (A), 
it is notable that the MSEs were more variable for the group lasso models than they 
were for the lasso models, suggesting that choice of coding strategy could result in 
a much less optimal model (possibly worse than linear regression) when using group 
lasso. We explore this issue more in the Monte Carlo simulation. In Appendix A, we 
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Figure 1: Graphical Presentation of Category Means for Education Recreated by Lasso 
Models with Different Coding Strategies. Intercept values are different across cod- 
ing strategies. The intercept value is the estimated category mean for no education in 
dummy and sequential coding and the average of the category means in contrast and 
Helmert coding. 


demonstrate similar issues with the choice of reference group or category order across 
different coding strategies, and in Appendix B we demonstrate that the use of singular 
design matrices (e.g., including dummy codes for all categories) does not ameliorate 
this issue. 


Table 9: Model Fit (MSE) for Different Coding Strategies by Lasso and Group Lasso 


Regression 
Coding strategies Dummy Contrast Sequential Helmert 
Lasso Regression 0.13669 0.13660 0.13675 0.13677 
Group Lasso Regression 0.13711 0.13689 0.13691 0.13695 


Note. Rows represent different lasso methods, and columns represent models with different 
coding strategies. The lowest value (best prediction) in each row is bolded. 


2.3 Summary 


Choice of coding strategy has the potential to affect both variable selection and 
prediction accuracy in lasso regression models. As a result, depending on the coding 
strategy used, an analyst may end up with different variables included in their model, 
different predicted scores, and different prediction accuracy. With both the model’s vari- 
able selection and predictive performance dependent on how categorical predictors are 
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represented in the model, it is not a choice that should be taken lightly. Ideally, there 
would be a method which provides the same variable selection and the same predicted 
scores regardless of the coding strategy chosen. 

Group lasso partly addresses the issues caused by the choice of different coding 
strategies in lasso regression, because group lasso’s variable selection is not affected 
by the coding strategy used. Therefore, if researchers use group lasso to select which 
variables contribute to the outcome variable, they do not need to worry that different 
coding strategies may result in different conclusions. However, coding strategies still 
affect the prediction accuracy of group lasso models. Therefore, if researchers aim to 
predict the outcome variable by using group lasso regression, they need to be aware 
that different coding strategies can result in different prediction accuracy. In addition, 
because group lasso is selecting more variables into the model, the robustness of group 
lasso across coding strategies may come at the cost of prediction accuracy. Comparing 
the MSEs between the lasso models and group lasso models, the lasso models typically 
have lower MSE (i.e., better prediction accuracy) than group lasso. 

This trade-off between prediction accuracy and robustness leads to some additional 
concerns about the group lasso. There seems to be a trade-off between including a set 
of predictors in a model, as compared to when a specific predictor. For example, if 
the average stress for all levels of education was the same except for those with PhDs, 
would group lasso still select the education set of variables into the model? Will the set 
of indicators for the categorical variables be selected if there is only one category that 
differs from the other categories within that variable? If this group is selected into the 
model, this means that many additional parameters would also be included to capture an 
effect that is only attributable to one indicator variable. Alternatively, if the group is not 
selected, then the predictive ability of the group lasso model may suffer. This problem 
does not occur with lasso, as it is able to include a single indicator variable to represent 
one category differing from the rest. Next, we explore this specific case and examine if 
group lasso’s ability to include groups of variables leads to issues with overfitting. 


3 Monte Carlo Simulation 


In this section, we use a Monte Carlo simulation to explore a potential weakness of 
group lasso: overfitting. Group lasso may select more variables than necessary into the 
model, leading to larger variance and lower prediction accuracy. We explore a partic- 
ularly extreme data generation case, where across all categories within one categorical 
variable, only one category differs from the rest. We call this category the dominant 
category and refer to all others as non-predictive categories. A non-predictive category 
is always used as the reference category in the analysis. While the simulation is much 
simpler than cases that would occur in real data analysis, it provides a clear demonstra- 
tion of a pattern that is likely to occur and be problematic and hard to identify in more 
complex situations. 


Simulation Method The data was generated such that the dominant category had a 
nonzero category mean, while non-predictive categories all had category means of zero. 
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All categorical variables were encoded using dummy coding. A second predictor vari- 
able was generated to follow a standard normal distribution. The outcome variable was 
created by adding the category mean, the value of the continuous variable, and a random 
error term drawn from a standard normal distribution. For optimal prediction, both the 
continuous predictor and the indicator variable which estimates the difference between 
the dominant category and other non-predictive categories should be included in the 
model, while the variables associated with non-predictive categories should not. 

As previously mentioned, the number of categories within categorical predictors 
may affect how the coefficients are estimated and how the model selects predictors in 
group lasso. Therefore, we varied the number of non-predictive categories (2,3,4). To 
examine how the effect size would affect group lasso’s prediction accuracy and variable 
selection, we also simulated different dominant category means (0.1, 0.2, 0.3). For each 
combination of number of categories and effect size, we randomly generated 500 data 
sets with a sample size of 1200. 

For each data set, we first split the data set into training and test sets randomly based 
on an 8:2 ratio. Then we fit lasso and group lasso models with the same training data. 
We selected the penalty parameter using the same cross-validation methods used in 
previous sections. For each model, we calculated the MSE, whether the model included 
the dominant category, and whether the model included the non-predictive categories. 
We calculated the average prediction accuracy of each method as well as the proportion 
of models that included the dominant category and the proportion that included non- 
predictive categories across each condition. For group lasso, these two proportions were 
always the same because group lasso either includes or excludes all categories within 
the categorical predictor. 


Simulation Results We first found that in all conditions lasso had a higher prediction 
accuracy than group lasso, indicated by lower MSEs (Table 10). Though the differ- 
ences in MSE of lasso and group lasso were small, they were consistent across different 
conditions. Secondly, for both group lasso and lasso regression, when the number of 
non-predictive categories increased, the probability for models to include the dominant 
category decreased, but the probability for lasso was consistently greater than or equal 
to that for group lasso (Figure 3). This means that lasso is more likely to include the 
dominant category than group lasso across the number of non-predictive groups. Figure 
2 shows that when the number of non-predictive categories stayed the same, the prob- 
ability for group lasso to include non-predictive categories increased when the effect 
size increased, while the probability for lasso remained relatively flat. For both mod- 
els, the probability of including non-predictive categories decreased as the number of 
non-predictive categories increased. 

Returning to the potential issue of overfitting in group lasso, consider the case where 
the dominant group mean is large. Figure 2 shows that when the dominant group mean 
was 0.3, group lasso had a higher probability than lasso of including non-predictive 
categories. In this case, group lasso could overfit the data because group lasso was more 
likely to include categories that were not supposed to be in the model. This also explains 
group lasso’s lower prediction accuracy than lasso in Table 10 when the dominant group 
mean was large. 
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Table 10: Differences in MSE of Lasso and Group Lasso Models for Monte Carlo Sim- 
ulation 


Number of Non-predictive Categories 


Dominant Category Mean 2 3 4 
0.1 0.0028 0.0029 0.0003 
0.2 0.0020 0.004 0.0008 
0.3 0.0029 0.0029 0.0030 


Note. Values larger than zero mean that the MSE for group lasso is larger than the MSE for 
lasso. 


Number of Non-predictive Categories = 2 Number of Non-predictive Categories = 3 

E E 
[o] © 9 
D D = 
2 2 
oy T o 
(Š) Ò 2 
o o 2 
2 2 
Ss z 2 
5 = 9 
2 2 
> 2 5 
c fod oS 
O O 
= Z o 
3 3° 
3 =] 
x) © WwW 
£ £ ō 
2 2 
> > + 
£ £ o 
3 $ 

M 
È $ s 
a a 

Category Means Category Means 


Number of Non-predictive Categories = 4 


Probability to Include Non-predictive Categories 


Category Means 


Figure 2: Comparison of Probabilities of Including Non-Predictive Categories under 
Different Numbers of Categories for Lasso and Group Lasso Models 
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Figure 3: Comparison of Probabilities of Including the Dominant Category under Dif- 
ferent Dominant Category Means for Lasso and Group Lasso Models 
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Simulation Summary Using Monte Carlo simulation, we demonstrated conditions 
under which group lasso may be likely to have issues with overfitting. When one or just 
a few categories differ from the rest, lasso may be more efficient with better prediction 
accuracy than group lasso. In these cases, group lasso is likely to include the categorical 
variable, including all non-predictive categories. Therefore, if researchers use group 
lasso to build predictive models, they may want to examine if one or two categories 
have relatively dominant means within categorical variables in advance, or if this pattern 
is hypothesized to occur they might prefer lasso. Looking for these effects may be 
particularly difficult in cases with many predictors where limited theoretical knowledge 
are driving the modeling, which is often the case when lasso is used. The differences 
must be conditional on all other variables in the data, not just examining the group 
means. If there are many categorical predictors in the model, exploratory analyses could 
be undertaken for each categorical variable, but this could a be tedious undertaking. 
Overall, this simulation demonstrates that there may be situations in which group lasso 
is not optimal for handling categorical predictors, especially if prediction accuracy is a 
high priority. 


4 Discussion 


In this paper, we demonstrate that lasso and group lasso models are sensitive to de- 
cisions about coding strategy for categorical predictors (e.g., dummy or sequential) and 
the choice of reference group/order of the categories (Appendix A). Linear regression 
does not have this problem, as the model fit and predicted values do not vary depend- 
ing on the coding strategy. Group lasso presents a partial solution by having consistent 
variable selection across coding strategies. However, this consistency may come at a 
cost of reduced prediction accuracy. Ultimately, this leaves open the question of which 
coding strategy should be chosen. In the next section, we explore potential solutions to 
this issue with categorical predictors in lasso-based models. 


4.1 Exploring Potential Solutions 


Regardless of which of the following solutions researchers choose, one thing is 
always required: transparency. In searching the literature for examples of applications of 
lasso with categorical predictors, we found very few teams reported the coding strategy 
or order of categories used. Researchers using categorical variables in lasso or group 
lasso regression need to report how they coded the variables (both coding strategy and 
variable order/reference group) as this is imperative for reproducing or replicating their 
results. The following are a few proposed solutions, none of which seem satisfactory 
for all cases. As such, we weigh the pros and cons of each and consider cases when 
each approach might be most acceptable. 


Prioritize Interpretability In cases where one coding strategy provides better inter- 
pretability of the model coefficients than another strategy, the most interpretable coding 
strategy could be chosen. This comes at the risk of having a worse predictive model, 
since the idea of interpretability is still very much rooted in the origins of inferential 
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rather than predictive statistical models. In particular, because the coefficient estimates 
in lasso regression are biased, they should not be interpreted directly. Rather, after vari- 
able selection is completed, common recommendations are to fit a linear regression 
model that only includes the selected variables (Hastie, Robert, & Wainwright, 2015). 
It would be unusual to include a coding strategy in the follow-up linear regression that is 
different from the strategy used in the lasso regression. Thus, researchers should choose 
the coding strategy for each categorical variable that would be most interpretable if that 
variable was selected by a variable selection procedure to remain in the model. Coding 
schemes like Helmert coding require the presence of all predictors to have the intended 
interpretation, and should perhaps only be used in concert with group lasso (ensuring 
all predictors are selected in or out of the model) if interpretability is the top priority. 
Notably, machine learning approaches are often used in cases where there are many 
variables included in the analysis, and relatively little theory regarding which variables 
should be predicting the outcome. This could make it difficult for the researcher (or an- 
alyst) to decide which coding scheme would be “most interpretable,” especially consid- 
ering the many possible combinations of coding schemes and variable orders/reference 
groups. 


Prioritize Robust Variable Selection Based on the real data analysis and the simula- 
tion results, the group lasso is robust to coding strategy choices with respect to variable 
selection. Prediction accuracy is not necessarily optimized for the group lasso. How- 
ever, when the goal is to select variables, and especially when it is conceptually useful 
to keep or drop all indicators for each categorical variable, group lasso seems to be an 
optimal choice. Nevertheless, this may come at a cost of prediction accuracy, particu- 
larly if categorical variables follow the dominant group pattern explored in the Monte 
Carlo simulation above, where one group is distinct from all other groups. 


Prioritize Prediction Another option when estimating lasso or group lasso models 
would be to try many different coding strategies in order to select the one with the best 
overall prediction accuracy. This process should likely be completed using the train- 
ing data so it does not influence the final prediction accuracy estimate acquired using 
an independent sample of the data. This approach can be very computationally inten- 
sive. With multiple categorical variables in the data set, trying different combinations 
of coding strategies would result in maximized prediction accuracy. 


Notably, if prediction accuracy is of the highest priority, alternative machine learn- 
ing approaches typically have higher prediction accuracy than lasso approaches, and 
many are robust to coding strategy. Techniques like classification and regression trees 
(CART) are unaffected by coding strategy because categorical predictors are treated as 
a single variable (Finch & Schneider, 2007). Realistically, researchers may be balancing 
their comfort with advanced analytic methods and their priority of prediction accuracy. 
CART methods do not provide the ”regression-like” estimates which many behavioral 
scientists rely on for interpreting their results. 
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4.2 Future Directions 


There are several future directions we believe would be particularly beneficial for 
improving the state of research in the area of (group) lasso regression with categorical 
predictors. The first is the concept of intercept penalization. The typical practice within 
lasso is not to penalize the intercept (Wu & Lange, 2008), but the interpretation of the 
intercept varies greatly depending on which coding scheme is used. For example, when 
dummy coding is used, the intercept is the average of the reference group. Alternatively, 
when contrast coding is used, the intercept is the average of all groups. Ultimately, 
this means that different group means have differential penalization depending on the 
coding strategy used (as reflected in Figure 1). Thus, it is worth investigating whether 
penalizing the intercept may be appropriate in certain cases, and whether this would im- 
prove prediction accuracy (just as penalizing all other regression coefficients improves 
prediction accuracy in lasso). This question remains largely unexplored and would be 
informative to researchers who are interested in improving prediction accuracy. 

Current defaults in software suggest that the field norm for coding strategy is dummy 
coding. The current research has demonstrated that dummy coding is a potentially risky 
choice as a default, as the choice of reference group can greatly impact the model, and 
the shrinkage is toward a group mean. Alternatively, contrast coding may make it an 
appealing default for researchers unsure about which coding strategy to use. Because 
the interpretation of the intercept for contrast coding is the average across all groups, 
the penalization of the groups is symmetric about this average. This means that when 
a coefficient is dropped from the model, the group that is indicated by this predictor 
is assumed to be equal to the grand mean. This method contrasts with dummy coding 
where all estimated group means are shrunk towards the reference group score. As a 
result, the selection of the reference group in contrast coding has less of an impact on 
parameter estimates than it does in dummy coding, because by selecting a reference 
group in dummy coding, that group’s score is not at all penalized (if the intercept is 
not penalized). The interpretation of the intercept from contrast coding also aligns with 
how intercepts would be interpreted if there were no categorical variables in the model 
and all continuous predictors were standardized (i.e., sample average). Thus, contrast 
coding stands as a reasonable default if researchers are unsure of which coding strategy 
to choose; however, the use of contrast coding should be studied further in a variety of 
contexts to assess its appropriateness as a potential default. 

Another observation our team made during this investigation was that group size 
mattered quite a lot with respect to how much predicted group scores varied across dif- 
ferent coding strategies. In particular, in the COVID stress data, the no education group 
was particularly small (N = 77 out of 10,000 observations). This resulted in two prob- 
lems that merit further investigation. The first is how group size can impact estimates 
and interact with the selection of coding strategy/reference group. Previous research by 
Choi, Park, and Seo (2012) has already shown that variability in the number of groups 
that categorical predictors contain can influence whether lasso or group lasso produces 
better prediction accuracy and recovery of model coefficients. As can be seen in Figure 
1 and Table 7, the estimated means for the no education group in the COVID stress data 
were very unstable and varied more across coding strategies than any other group. Sim- 
ilarly, in Table 12 in Appendix A, we can see that the estimates of all of the Education 


Lasso with Categorical Predictors 35 


group means have the greatest bias when no education is used as the reference group. 
Future research should examine how variability in the sizes of those groups can impact 
the fitting of lasso and group lasso models 

A second issue brought up by having small groups is the difficulty of splitting test 
and training data sets. This may become particularly problematic when there are many 
categorical variables that include many groups. Previous researchers have resolved to 
combine groups that are particularly small (e.g., racial/ethnic minorities; Webb et al., 
2019). It is unclear how this practice impacts estimates for these groups, however, and in 
general combining groups is actively discouraged for other analytic methods (Tarantola 
& Dellaportas, 2005). Methods for splitting the data such as block randomization may 
provide more accurate predictions for small groups if the groups can be evenly split 
across the training and test sets. 


4.3 Conclusion 


Overall, our findings suggest that researchers should be cautious and purposeful 
about selecting their coding strategies when using lasso or group lasso. These choices 
will impact both variable selection and prediction accuracy when using lasso and pre- 
diction accuracy when using group lasso. However, just because variable selection is 
not impacted in group lasso does not mean this method should always be preferred. In a 
simulation study, we demonstrated cases where group lasso may have lower prediction 
accuracy than lasso, particularly when there is a dominant group (one group that dif- 
fers from all other groups). The choices of which method to use (lasso or group lasso), 
what coding strategy to use, and which group order/reference category to use should 
depend on the researcher’s priorities. How categorical variables are represented in lasso 
or group lasso models must be transparently reported to maximize reproducibility and 
replicability. Future research should explore specific practices in this area such as pe- 
nalization of the intercept, the use of contrast coding, and how small groups should be 
accounted for to optimize prediction accuracy for these groups. 

Behavioral scientists are quickly adopting useful tools developed in statistics and 
computer science which fit under the broad area of machine learning and artificial intel- 
ligence. The use of these tools will likely improve the ability of behavioral researchers 
to predict out-of-sample data, which may be particularly important in clinical settings 
and precision medicine. However, it is important to acknowledge that these new tools do 
not necessarily perform in the same ways that many researchers expect based on their 
training, which is primarily in linear regression and ANOVA frameworks (Aiken, West, 
& Millsap, 2008). Ensuring that the differences between these more traditional statis- 
tical frameworks and the newly developed machine learning frameworks are clearly 
defined will improve the implementation of these new methods throughout the field of 
behavioral science. 
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Appendix A Different reference categories 


In addition to the analyses presented in the primary manuscript, we also examined 
how variable selection and prediction accuracy in lasso and group lasso models differ 
across choices within a specific coding strategy. These choices include reference cate- 
gories (dummy and contrast coding) and the order of categories (sequential and Helmert 
coding). We tested whether the category chosen as the reference category in the dummy 
coding strategy matters for variable selection and prediction accuracy. Consider, for ex- 
ample, the dominant group case where all groups have the same mean except one group. 
If that one group is selected as the reference category, then all k — 1 predictors should be 
selected into the model, because all other groups are different from the reference. If any 
other group is selected as the reference group, then only 1 predictor should be selected 
into the model (the indicator for the difference between the one deviant group and the 
reference). While the pattern of means is not different, the reference group may have a 
large impact on the size of the coefficients and the number of non-zero coefficients. 

We fit lasso and group lasso models with all six dummy-coded categorical variables 
and seven continuous variables using the COVID stress data. To explore how choices of 
reference categories affect estimated coefficients, we fit seven models for each regres- 
sion method with differences only in their choices of reference categories in the variable 
Education. The reference categories were chosen and fixed for all other categorical vari- 
ables. Therefore, the differences between these models can only be attributed to the dif- 
ferent choices of the reference category of the variable Education. While this example 
uses dummy coding, we believe the results would generalize to other coding strategies 
(e.g., choice of the reference group for contrast coding, order of groups for Helmert and 
sequential coding). 


Appendix A.1 Variable Selection 


Table 11 shows the coefficients of indicators for Education. The size of the coeffi- 
cients varies depending on which group is the reference, which could pose a problem 
for lasso regression because coefficients and the penalty parameter decide whether the 
variable will be selected into the model, according to Equation 5. Different coefficients 
are not necessarily a problem by themselves; however, these results demonstrate cer- 
tain asymmetries that are concerning. When coefficients vary from model to model, 
the variable selection can differ. For example, when “none” was the reference category, 
the college category was not selected into the model (i.e., the none and college cate- 
gories are assumed to be equal). However, when “college” was chosen as the reference 
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category, the none category was selected into the model (i.e., the none and college cat- 
egories are treated differently). This marks a particularly concerning lack of symmetry 
between these lasso models. 


Table 11: Model Coefficients for Different Reference Categories by Lasso 
Reference Category 


Variables None 6years 9years 12 years Some college College PhD/Doctorate 
Intercept 2.637 2.574 2.649 2.668 2.657 2.641 2.649 
None : -0.013 -0.076 -0.092 -0.083 -0.068 -0.075 

6 years -0.068 ; -0.079 -0.095 -0.086 -0.071 -0.078 

9 years 0.010 0.065 : -0.006 0 0.009 0.002 

12 years 0.033 0.084 0.024 ; 0.017 0.032 0.024 
Some college 0.017 0.067 0.008 -0.006 ; 0.016 0.008 
College 0 0.049 -0.008 -0.024 -0.015 : -0.008 
PhD/Doctorate 0.005 0.057 0 -0.015 -0.006 0.005 


Note. Each column represents one model, and each row represents the coefficients for Education 
produced by each model.“.” is the reference category for the corresponding model, and 0 means 
that lasso does not select the corresponding predictor to be included in the model. 


Group lasso models included all categories within the variable Education when dif- 
ferent categories were chosen as the reference categories, meaning that all categories 
were treated as different in all group lasso models. Group lasso ensures stable perfor- 
mance of variable selection across reference categories. 

We also explored the effect of different reference categories in education on other 
predictors and found that choosing different reference categories affects the coefficients 
and variable selection of other predictors (categorical and continuous) in lasso models. 
Group lasso models, on the other hand, still performed consistent variable selection for 
predictors that did not have their reference categories changed. In our case, group lasso 
models always included all categories within the other five categorical predictors and 
all seven continuous predictors. 


Appendix A.2 Prediction Accuracy 


We examined the prediction accuracy from two aspects: predicted category scores 
and model fit, varying the reference group used in dummy coding education. 


Predicted Category Scores Predicted values for each category were different in both 
lasso and group lasso models from Tables 12 and 13. For the no education category, 
lasso models with different reference categories predicted different values, ranging from 
2.982 to 2.915. Group lasso models also predicted different values for the no education 
category, ranging from 2.983 to 2.991. This indicates that with different choices of 
reference categories, predicted values vary from model to model for both lasso and 
group lasso. 
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Table 12: Predicted Category Means and Prediction Accuracy for Different Reference 
Categories by Lasso 


Reference Category 


Category None 6years 9years 12 years Some college College PhD/Doctorate 
None 2.982 2.920 2.915 2.915 2.915 2.915 2.915 

6 years 2.914 2.933 2.912 2.912 2.912 2.912 2.912 

9 years 2.992 2.998 2.990 3.001 2.998 2.992 2.992 

12 years 3.015 3.017 3.014 3.006 3.014 3.014 3.014 
Some college 2.999 3.001 2.998 3.000 2.998 2.998 2.998 
College 2.982 2.983 2.982 2.982 2.982 2.983 2.982 
PhD/Doctorate 2.988 2.990 2.990 2.991 2.991 2.987 2.990 
MSE 0.13669 0.13678 0.13674 0.13684 0.13675 0.13674 0.13674 


Note. Each column represents one model, and each row (besides the last) represents the 
predicted category means for Education produced by each model (with all continuous predictors 
set to their means and all other categorical variables set to their modes). The last row contains 
the MSE of the corresponding model. 


Table 13: Predicted Category Means and Prediction Accuracy for Different Reference 
Categories by Group Lasso 


Reference Category 


Category None 6years 9years 12 years Some college College PhD/Doctorate 
None 2.986 2.986 2.986 2.990 2.991 2.983 2.987 

6 years 2.971 2.978 2.974 2.983 2.981 2.971 2.975 

9 years 2.988 2.987 2.968 2.979 2.975 2.965 2.969 

12 years 3.002 3.001 3.004 2.991 2.992 2.985 2.988 
Some college 2.996 2.996 2.997 2.996 3.004 3.003 3.004 
College 2.983 2.983 2.983 2.983 2.983 2.996 2.997 
PhD/Doctorate 2.988 2.988 2.988 2.990 2.990 2.987 2.983 
MSE 0.13711 0.13719 0.13709 0.13727 0.13709 0.13708 0.13710 


Note. Same as Table 12 


Figure 4 visualizes the shrinkage effect when different reference categories were 
chosen in lasso models using Education to predict Stress. In this case, the intercept is 
the predicted category mean of each model’s reference category because models are 
coded by dummy coding strategies. Similar to Figure 1, we can conclude that recreated 
category scores shrink towards the reference value for dummy coding. 


Model Fit Model fit, measured by MSE, for both lasso and group lasso models are 
shown in Table 12 and 13. MSEs were generally different across reference categories. 
Note that MSEs in Table 12 and 13 were rounded. Although some MSEs were very 
close to each other and were rounded to the same value, they were not exactly the same, 
which would be the case if linear regression was used. 
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Figure 4: Graphical Presentation of Category Means for Education Recreated by Lasso 
Models with Different Reference Categories. Intercept values are different across refer- 
ence categories. In dummy coding, the intercept value is the estimated category mean 
of the corresponding reference category. 


Appendix B Singular Design Matrices 


STATA is a commonly used statistical software that can implement lasso regression, 
and in STATA categorical predictors are handled by including a singular design matrix 
StataCorp (2019). In this section, we examine this alternative method for creating the 
design matrices for categorical variables. When we introduced categorical variables, we 
noted that for a variable with k categories, k — | indicators are created for this variable. 
Different coding strategies use different matrices to represent the k — | indicators and 
model coefficients represent differences between categories and the reference value, as 
this is common practice for linear regression. The researcher must then choose the ref- 
erence category for analysis. However, there is another way to create the design matrix 
for categorical predictors where the researcher does not need to explicitly choose the 
reference category. Instead of using k — 1 indicators for a categorical variable with k 
categories, we use k indicators. This design matrix allows lasso or group lasso to essen- 
tially select the reference values. Mathematically, this type of design matrix is defined 
as singular, because the matrix is not invertible. Singular design matrices cannot be 
used for linear regression, but lasso and group lasso regression can accommodate sin- 
gular design matrices, making this a unique potential solution to the variable selection 
and prediction accuracy issue related to categorical variables in lasso and group lasso. 

Can singular design matrices solve the inconsistency in lasso’s variable selection 
and prediction accuracy or group lasso’s prediction accuracy across coding strategies? 
To create singular design matrices, we appended a linearly independent column with 
only 1 in the first row to the matrices in Table 1 and 3, and a linearly independent col- 
umn with only | in the last row to matrices in Table 2 and 4. If using singular design 
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matrices solves the issues of variable selection and prediction accuracy, these two prop- 
erties should be equivalent across these four design matrices. To test this, we used the 
same data set and applied the same process as before to fit lasso and group lasso models 
with Education serving as the only predictor variable. Table 14 shows the coefficients 
of the categorical variable Education in lasso models as an example of lasso’s vari- 
able selection. Using a singular design matrix for categorical variables, different coding 
strategies still lead to different lasso model’s variable selection. Contrastingly, group 
lasso selected all categories and performed the same variable selection. For example, 
the contrast-coded lasso model treated the 9 years of education and PhD categories 
as the same, while these two categories were always treated as different in the other 
three lasso models and the four group lasso models. In addition, lasso and group lasso 
models using different coding strategies led to different prediction accuracies, shown 
in Table 15 and 16. This means that using singular design matrices does not solve the 
inconsistent variable selection or prediction accuracy for lasso, nor does it solve the 
inconsistency in prediction accuracy for group lasso. There are infinitely many singular 
design matrices that could be used, and if they all result in different solutions, this does 
not provide strong evidence that the identity matrix system used by StataCorp (2019) 
would perform optimally. 


Table 14: Model Coefficients Using Singular Design Matrix with Lasso 


Coding strategies Dummy Contrast Sequential Helmert 
Intercept 2.991 2.961 2.873 2.961 
1. no -0.109 -0.081 0.015 0.103 
2. 6 years -0.086 -0.063 0.076 0.095 
3. 9 years -0.006 0 0.034 0.023 
4. 12 years 0 0.034 0.010 0 

5. some college 0.016 0.051 -0.017 -0.017 
6. college degree 0 0.028 0 0 

7. PhD 0 0 -0.009 0 


Note. Each column represents one model, and each row represents the coefficient for an 
indicator of Education produced by the corresponding model. A 0 means that lasso does not 
select the corresponding category into the model. 
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Table 15: Predicted Category Means and Prediction Accuracy for Different Coding 
Strategies using Singular Design Matrices with Lasso 


Coding Strategy 

Category Dummy Contrast Sequential Helmert Observed Mean 
None 2.882 2.881 2.864 2.873 2.852 

6 years 2.905 2.898 2.888 2.897 2.883 

9 years 2.986 2.961 2.964 2.973 2.962 

12 years 2.991 2.995 2.999 2.997 2.997 
Some college 3.007 3.012 3.008 3.008 3.013 
College 2.991 2.990 2.991 2.991 2.990 
PhD/Doctorate 2.991 2.992 2.991 2.991 2.991 
MSE 0.15630 0.15620 0.15616 0.15621 / 


Note. Each column (besides the last) represents one model, and each row (besides the last) 
represents the predicted category means for Education produced by each model. The last 
column contains the category means observed in the training data set. The last row contains the 
MSE of the corresponding model. 


Table 16: Predicted Category Means and Prediction Accuracy for Different Coding 
Strategies Using Singular Design Matrices with Group Lasso 


Coding Strategy 

Category Dummy Contrast Sequential Helmert Observed Mean 
None 2.873 2.869 2.878 2.871 2.852 

6 years 2.892 2.890 2.909 2.891 2.883 

9 years 2.962 2.961 2.955 2.962 2.962 

12 years 2.996 2.996 2.994 2.996 2.997 
Some college 3.012 3.012 3.010 3.012 3.013 
College 2.990 2.990 2.991 2.990 2.990 
PhD/Doctorate 2.990 2.991 2.991 2.990 2.991 
MSE 0.15619 0.15619 0.15622 0.15619 / 


Note. Same as Table 15 


